Machine Learning: Mathematical Theory and Applications
Computer lab 4
Group 12 - Hugo Xinghe CHEN & Jianhan WANG
Published
October 30, 2024
Warning
The instructions in this lab assume that the following package is installed:
caret
Packages may be installed by executing install.packages('packagename'), where 'packagename' is the name of the package, e.g. 'splines'. You may have to install additional packages to solve the computer lab. If you want this file to run locally on your computer, you have to change some paths to where the data are stored (see below).
Introduction
This computer lab treats topics such as supervised learning, unsupervised learning, and semi-supervised learning.
Instructions
In this computer lab, you will work in groups of two. The group hands in a single set of solutions. It is strictly forbidden to copy codes or solutions from other groups, as well as the use of AI tools (such as ChatGPT). Not obeying these instructions is regarded as academic misconduct and may have serious consequences for you.
This computer lab is worth a total of 15 marks (the subject has a maximum of 100 marks). The maximum marks for a given problem is shown within parenthesis at the top of the section. The problems you should solve are marked with the symbol 💪 and surrounded by a box. Hand in the solutions and programming codes in the form of a html document generated by Quarto. Before submitting, carefully check that the document compiles without any errors. Use properly formatted figures and programming code.
Warning
Not submitting according to the instructions may result in loss of marks. The same applies to poorly formatted reports (including poorly formatted code/figures, unnecessarily long outputs, etc.).
1. Linear discriminant analysis for cancer data (2 marks)
In this problem, we consider a generative classification model for breast cancer data, where features computed from 569 digitised images are modelled and subsequently used to classify the tumor type (malign or benign). The data is stored in the file cancer_data_10features.Rdata, which can be downloaded from the Canvas page of the course1.
We will consider the two features radius (mean of distances from center to points on the perimeter ) and texture (standard deviation of gray-scale values). Your task in this problem is to implement a linear discriminant analysis method without using packages (such as the lda() function from the MASS package).
The following code reads in the data, creates the training and test datasets, and visualises these data for the classification problem.
set.seed(1234)suppressMessages(library(caret))load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/cancer_data_10features.RData')train_obs <-createDataPartition(y = cancer_data_10features$diagnosis, p = .80, list =FALSE)train <- cancer_data_10features[train_obs, 1:3]test <- cancer_data_10features[-train_obs, 1:3]# Confirm both training and test are balanced with respect to diagnosis (malign tumor)cat("Percentage of training data consisting of malign tumors:", 100*mean(train$diagnosis =="M"))
Percentage of training data consisting of malign tumors: 37.2807
cat("Percentage of test data consisting of malign tumors:", 100*mean(test$diagnosis =="M"))
Percentage of test data consisting of malign tumors: 37.16814
# Train and test for each tumor classind_train_M <- train$diagnosis =="M"ind_test_M <- test$diagnosis =="M"plot(train[ind_train_M, 2], train[ind_train_M, 3], xlab ="Radius", ylab ="Texture", col ="cornflowerblue", xlim =c(5, 30), ylim =c(5, 40))lines(train[!ind_train_M, 2], train[!ind_train_M, 3], col ="lightcoral", type ="p")lines(test[ind_test_M, 2], test[ind_test_M, 3], col ="cornflowerblue", type ="p", pch ="x")lines(test[!ind_test_M, 2], test[!ind_test_M, 3], col ="lightcoral", type ="p", pch ="x")legend("topright", legend =c("Malign train", "Benign train", "Malign test", "Benign test"), col =c("cornflowerblue", "lightcoral", "cornflowerblue", "lightcoral"), pch =c("o", "o", "x", "x"))
Assume that the class conditionals \(p(\mathbf{x}|y)\), for \(y\in\{1, 2\}\) (1-malign, 2-benign), follow a bivariate normal distribution with different means, however, with the same covariance matrix, i.e. \[\mathbf{x}|y=m \sim \mathcal{N}(\boldsymbol{\mu}_m,\boldsymbol{\Sigma}),\,\,m=1,2.\]
Let \(\pi_m=\Pr(y=m)\) for \(m=1,2\). These \(\pi_m\) probabilities can be interpreted as the marginal (marginalising out \(\mathbf{x}\) through \(\Pr(y=m)=\int p(\mathbf{x}, y=m)\mathbf{d}\mathbf{x}\)) class membership probabilities, i.e. unconditional \(\mathbf{x}\). Given an \(\mathbf{x}=(x_1, x_2)^\top\), we want to compute its class membership probability. From Bayes’ theorem, it follows that the conditional distribution of the class membership is \[\Pr(y=m|\mathbf{x})=\frac{p(\mathbf{x}|y=m)\Pr(y=m)}{p(\mathbf{x})}\propto p(\mathbf{x}|y=m)\Pr(y=m),\,\,m=1,2.\]
The denominator above is the marginal density of \(\mathbf{x}\) (marginalising out \(y\) through \(p(\mathbf{x})=\sum_m p(\mathbf{x}, y=m)\)), which normalises \(\Pr(y=m|\mathbf{x})\) so that \(\sum_m\Pr(y=m|\mathbf{x})=1\).
💪 Problem 1.1
Derive (analytically) \(p(\mathbf{x})\) for the example above.
Given the training data, estimate the parameters \(\pi_m,\boldsymbol{\mu}_m\) for \(m=1,2\), and \(\boldsymbol{\Sigma}\). Predict the labels of the test data and plot them in a scatter plot with different colors to represent the different classes.
Tip
Let \(\mathbf{x}_\star=(x_{\star1}, x_{\star2})^\top\) be a point we want to predict \(y\) for and let \(\widehat{y}(\mathbf{x}_\star)\) denote the prediction. The prediction is obtained as \[\widehat{y}(\mathbf{x}_\star)= \arg \max_m \Pr(y=m|\mathbf{x})=\arg \max_m \log \Pr(y=m|\mathbf{x}).\]It can be shown that for the model above, \[\widehat{y}(\mathbf{x}_\star)=\arg\max_m \left\{ \log p(\mathbf{x}_\star|y=m) + \log\pi_m \right\}=\arg\max_m \left\{ \mathbf{x}_{\star}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{m} - \frac{1}{2}\boldsymbol{\mu}_{m}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{m}+\log \pi_{m}\right\}.\]
Plot the decision bound of the classifier and the predictions of the test data from Problem 1.2 in the same plot.
Tip
To determine the decision boundary, we can solve the equation \(\log\Pr(y=1|\mathbf{x}_\star)=\log\Pr(y=2|\mathbf{x}_\star)\) for \(\mathbf{x}_\star=(x_{\star1}, x_{\star2})^\top\). This gives \[\mathbf{x}_{\star}^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})- \frac{1}{2}\boldsymbol{\mu}_{1}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{1} + \frac{1}{2}\boldsymbol{\mu}_{2}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{2}+\log \pi_{1} - \log \pi_{2}=0.\]
After some algebra, we can express the above as \(x_{\star2}=\gamma_0 + \gamma_1x_{\star1}\) for some \(\gamma_0,\gamma_1\) that can be evaluated using the estimated \(\pi_m,\boldsymbol{\mu}_m\), and \(\boldsymbol{\Sigma}\). We can now construct a grid for \(x_{\star1}\) (e.g. x1_grid<-seq(5,30,length.out = 100)) and use \(x_{\star2}=\gamma_0 + \gamma_1x_{\star1}\) to plot the decision bound.
Fit a logistic regression to the training data using the glm() function. Compare the results to the generative model in Problem 1.2. Comment on the results.
Tip
By now you should know many metrics to evaluate a binary classifier.
train$diagnosis_binary <-ifelse(train$diagnosis =="M", 1, 0)logistic_model <-glm(diagnosis_binary ~ radius + texture, data = train, family = binomial)test$logistic_prob <-predict(logistic_model, newdata = test, type ="response")# Classify based on 0.5 thresholdtest$logistic_prediction <-ifelse(test$logistic_prob >0.5, "M", "B")test$logistic_prediction <-as.factor(test$logistic_prediction)plot(test[, 2], test[, 3], col =ifelse(test$logistic_prediction =="M", "red", "black"), main ="Scatter Plot of Logistic Regression Predicted Classes with LDA Boundary", xlab ="Radius", ylab ="Texture", xlim =range(test[, 2], na.rm =TRUE), ylim =range(test[, 3], na.rm =TRUE))legend("topleft", legend =c("Predicted Malign", "Predicted Benign"), col =c("red", "black"), pch =1)# Add LDA decision boundary to logistic regression plotlines(x1_grid, x2_boundary, col ="blue", lwd =2)legend("topright", legend =c("LDA Decision Boundary"), col ="blue", lwd =2)
We can see on the plot that LDA and logistic regression predict a few different classes. They have the same accuracy, but LDA performs better on precision and much worse on recall and specificity.
2. Quadratic discriminant analysis for cancer data (3 marks)
A more flexible model allows a different covariance matrix for each class conditional, i.e. \[\mathbf{x}|y=m \sim \mathcal{N}(\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m).\]It can be shown that \[\widehat{y}(\mathbf{x}_{\star})=\arg\underset{m}{\max}\left(-\frac{1}{2}\log\left|\boldsymbol{\Sigma}_{m}\right|-\frac{1}{2}(\mathbf{x}_{\star}-\boldsymbol{\mu}_{m})^{\top}\boldsymbol{\Sigma}_{m}^{-1}(\mathbf{x}_{\star}-\boldsymbol{\mu}_{m})+\log\pi_{m}\right),
\]
where \(\left|\cdot\right|\) denotes the determinant. We note that \(\mathbf{x}_\star\) is quadratic in the expression of the \(\log\Pr(y=m|\mathbf{x}_\star)\), and hence the decision boundary is not linear for this classifier.
💪 Problem 2.1
Derive (analytically) \(p(\mathbf{x})\) with the new class conditional distribution above.
Given the training data, estimate the parameters \(\pi_m,\boldsymbol{\mu}_m\), and \(\boldsymbol{\Sigma}_m\) for \(m=1,2\). Predict the labels of the test data and plot them in a scatter plot with different colors to represent the different classes.
Tip
The log of the determinant of a matrix \(A\) can be computed as log(det(A)).
Here, the QDA model mirrors the logistic regression model's performance. It means, in this case, both model perform better than the LDA model due to higher recall and specificity.
But in order to differ QDA and logistic regression's performance, we'll do a cross-validation in the following code.
And we can clearly see that the logistic regression performs the best in general.
💪 Problem 2.4
A doctor contacts you and says she has a patient whose digitised image has the features radius=13.20 and texture=19.22. Use your best classifier from Problem 2.3 to provide the doctor with some advice.
patient_data <-data.frame(radius =13.20, texture =19.22)patient_prob <-predict(logistic_model, newdata = patient_data, type ="response")patient_prediction <-ifelse(patient_prob >0.5, "Malign", "Benign")cat("Prediction for the patient with radius =", patient_data$radius, "and texture =", patient_data$texture, ":\n")
Prediction for the patient with radius = 13.2 and texture = 19.22 :
cat("Predicted class:", patient_prediction, "\n")
Predicted class: Benign
cat("Probability of malignancy:", round(patient_prob, 3), "\n")
Probability of malignancy: 0.156
We would advise the doctor that this patient probably has a benign tumor, but the chance of having a malignant tumor is 15.6%.
3. Unsupervised Gaussian mixture models (2 marks)
Both Problems 1 and 2 above are so-called supervised learning methods because the class labels are observed (we know the status of the tumor). In many problems, we do not observe the classes and we might not even know if the classes are interpretable as in the example above (the classes are interpreted as the severity of the tumor). Let us consider two examples to further illustrate what we mean with interpretability.
The first example concerns a dataset that contains measurements of the lengths (in mm) of 1000 insects of a certain species. The data is stored in the file insects.Rdata, which can be downloaded from the Canvas page of the course. The following code reads in and visualises the data.
load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/insects.RData')hist(insects$length, xlab ="Length (in mm)", ylab ="Counts", col ="cornflowerblue", main ="Histogram of the lengths of insects")
We see that there seems to be two groups of insects, the first group with lengths about \(\approx1\text{-}6\)mm and the second group with lengths about \(\approx7\text{-}11\)mm. Note that there are no other variables available in the dataset, but a plausible explanation that the data cluster this way is a second variable — sex. This is what we mean with cluster interpretability.
The next example concerns the relationship between the closing price and the trading volume (in log-scale) for the Associated Banc-Corp (ASB) stock that is traded in the New York Stock Exchange (NYSE). The data are on a daily frequency and cover the period 2014-12-26 to 2017-11-10. The data are stored in the file asb.Rdata, which can be downloaded from the Canvas page of the course2. The following code reads in and visualises the data.
load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/asb.RData')plot(Close ~log(Volume), data = asb, col ="cornflowerblue", main ="ASB stock: Closing price vs log of trading volume")
The data seem to be divided into two clusters. The following code plots a time series plot of the closing price.
date <-as.Date(asb$Date)Close <- asb$Closeplot(date, Close, col ="cornflowerblue",type ="l", main ="ASB stock: Closing prices during 2014-12-26 to 2017-11-10")
💪 Problem 3.1
Give a possible interpretation of the two clusters.
For the ASB stock data, such clustering could represent various regimes of trading or market conditions.
One cluster may reflect the lower volumes of trade associated with the lower prices of the stock, probably in stable periods or lesser interest in the stock.
The other cluster, at higher volumes and prices, would mean the increased investor interest, perhaps in periods of more volatility or news-driven periods.
This interpretation would possibly align with sentiment or volatility of the market.
Further insights of the unsupervised Gaussian mixture model can be gained by understanding the data generating process via simulation. Assume we have only one feature (like the insects example) and two classes. Let
The following code simulates \(n=1000\) observations from the specified mixture model above. It uses a for-loop for pedagogical reasons. Some explanation of the code follows.
mu1 <--2sigma1 <-0.5mu2 <-4sigma2 <-1.5pi1 <-0.2pi2 <-1- pi1# Store in vectors:mu <-c(mu1, mu2)sigma <-c(sigma1, sigma2)pis <-c(pi1, pi2)n <-1000y <-rep(NA, n)x <-rep(NA, n)for(i in1:n){# Simulate indicator with probability pi2 (1 - component 2, 0 - component 1) y[i] <-rbinom(n =1, size=1, prob = pis[2]) +1 x[i] <-rnorm(n =1, mean = mu[y[i]], sd = sigma[y[i]])}# Sanity check: Confirm the marginal class probabilities are pi1 and pi2 (approx)cat("Estimated class 1 marginal probability: ", mean(y ==1), ". True: ", pis[1], sep ="")
Estimated class 1 marginal probability: 0.209. True: 0.2
Estimated class 2 marginal probability: 0.791. True: 0.8
# Normalised histogramhist(x, xlab ="x", col ="cornflowerblue", main ="Normalised histogram of the simulated x", prob =TRUE)
Most of the code above is obvious except that within the for-loop. The function rbinom() simulates n=1 random number from a binomial distribution with size=1 trials and success probability prob=pis[2]. This is just a Bernoulli variable that takes on the value 0 and 1 with, respectively, probabilities \(\pi_1\) and \(\pi_2\). We add 1 to this to make the outcome \(m=1,2\) (instead of \(m=0,1\)). The next line of code simulates from a normal distribution, where the mean and variance depend on the class membership of the observation (stored in y[i]).
Plotting a (normalised) histogram of the simulated data (x) is an empirical representation of the marginal density \(p(x)\) you derived in Problem 3.2. By comparing them, we can ensure that the simulation above is correct (and that the expression you derived for \(p(x)\) is correct!).
💪 Problem 3.3
Plot the \(p(x)\) you derived in Problem 3.2 in the same figure as a (normalised) histogram.
Tip
Create a fine grid of \(x\) values (e.g. x_grid<-seq(-6,10,length.out=1000)) and evaluate \(p(x)\) on the grid. Make sure to set prob=TRUE in the hist() function (so the plots are in the same scale). Note that with a few bins it is hard to evaluate the quality of the approximation. Try setting the argument breaks=30 in the hist() function.
x_grid <-seq(-6, 10, length.out =1000)p_x <-0.2/sqrt(0.5*pi) *exp(-(x_grid+2)^2/0.5) +0.8/sqrt(4.5*pi) *exp(-(x_grid-4)^2/4.5)hist(x, xlab ="x", col ="cornflowerblue", main ="Normalised histogram of the simulated x with p(x)",prob =TRUE, breaks =30)# Overlay the density function p(x)lines(x_grid, p_x, col ="red", lwd =2)
4. Unsupervised learning via the EM algorithm (5 marks)
The expectation-maximisation (EM) algorithm is a powerful algorithm to learn the parameters in unsupervised learning models. In addition, the EM algorithm gives us the conditional (on the training data) distribution of the class membership for each observation, which we can use for classification. We will first learn how to implement the EM algorithm for the unsupervised Gaussian mixture model for a single feature \(x\) and two classes, which we studied in Problem 3. We stress that the difference compared to Problems 1 and 2 (which also had Gaussian components) is that the labels are not observed in this problem.
The following function implements the EM algorithm. The implementation assumes a single feature and two classes, i.e. \(x\in\mathbb{R}\) and \(M=2\).
EM_GMM_M2 <-function(x, mu_start, sigma_start, pis_start, n_iter =100) {# Estimates the parameters in an unsupervised Gaussian mixture model with M = 2 classes. # Runs the EM algorithm for n_iter iterations. x is assumed univariate. # mu_start, sigma_start, pis_start are starting values.stopifnot(sum(pis_start) ==1)# Quantities to save pis_all <-matrix(NA, nrow = n_iter, ncol =2) mu_all <-matrix(NA, nrow = n_iter, ncol =2) sigma_all <-matrix(NA, nrow = n_iter, ncol =2) Q_all <-rep(NA, n_iter) log_like_all <-rep(NA, n_iter)# Initialise mu <- mu_start sigma <- sigma_start pis <- pis_start n <-length(x) W <-matrix(0, nrow = n, ncol =2) # Unnormalised weights for each observation log_pdf_class <-matrix(0, nrow = n, ncol =2) # The log-likelihood of the two classes for each obs. To compute Q. for(j in1:n_iter){# Start EM steps# E-step: Compute the expected log-likelihood Qfor(m in1:2){# The log-density for each class log_pdf_class[, m] <-dnorm(x, mean = mu[m], sd = sigma[m], log =TRUE) +log(pis[m])# Unnormalised weights W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m]) } w <- W/rowSums(W) # Normalise weights n_hat <-colSums(w) # Expected number of obs per class Q <-sum(rowSums(w*log_pdf_class)) # Expected log-likelihood# M-step: Maximise Q. Closed form analytical solution in Gaussian mixture modelsfor(m in1:2){ pis[m] <- n_hat[m]/n mu[m] <-1/n_hat[m]*sum(w[, m]*x) sigma[m] <-sqrt(1/n_hat[m]*sum(w[, m]*(x - mu[m])^2)) }# End EM steps. Save estimates, Q, and log-likelihood pis_all[j, ] <- pis mu_all[j, ] <- mu sigma_all[j, ] <- sigma Q_all[j] <- Q# Compute log-likelihood at current parameter valuesfor(m in1:2){# Unnormalised weights W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m]) } log_like_all[j] <-sum(log(rowSums(W))) } # End EM iterations# Return everything as a listreturn(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma, weights = W/rowSums(W), pis_all = pis_all, mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all, log_like_all = log_like_all))}
The log-likelihood is guaranteed to not decrease at any iteration, which makes the variable log_like_all above useful for debugging. The following code uses the function above to estimate the parameters using our simulated data from Problem 3, performs some convergence checks, and plots the class posterior probability distribution for an observation.
matplot(0:n_iter, rbind(mu_start, EM_result$mu_all), main ='mu', pch =c("o", "o"), col =c("cornflowerblue", "lightcoral"), xlab ="Iteration", ylab ="mu", ylim =c(-3, 6))legend("topright", legend =c("Class 1", "Class 2"), col =c("cornflowerblue", "lightcoral"), pch =c("o", "o"))
matplot(0:n_iter, rbind(sigma_start, EM_result$sigma_all), main ='sigma', pch =c("o", "o"), col =c("cornflowerblue", "lightcoral"), xlab ="Iteration", ylab ="sigma", ylim =c(0, 4))legend("topright", legend =c("Class 1", "Class 2"), col =c("cornflowerblue", "lightcoral"), pch =c("o", "o"))
par(mfrow =c(1, 1))# Inspect convergenceplot(EM_result$log_like_all, main ='Log-likelihood', type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
plot(EM_result$Q_all, main ='Expected log-likelihood (Q)', type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
print("True parameters (pi, mu, and sigma)")
[1] "True parameters (pi, mu, and sigma)"
print(pis)
[1] 0.2 0.8
print(mu)
[1] -2 4
print(sigma)
[1] 0.5 1.5
print("Estimated parameters (pi, mu, and sigma)")
[1] "Estimated parameters (pi, mu, and sigma)"
print(EM_result$pi_hat)
[1] 0.2086633 0.7913367
print(EM_result$mu_hat)
[1] -2.096384 3.983013
print(EM_result$sigma_hat)
[1] 0.479649 1.465332
# Inspect the classification probabilities of observation 10ind <-10barplot(names.arg =c("Class 1", "Class 2"), EM_result$weights[ind, ], col ="cornflowerblue", ylim =c(0, 1), main =paste("Class (posterior) probability observation ", ind, sep =""))
💪 Problem 4.1
Explain the label-switching problem when estimating unsupervised Gaussian mixture models. Do you observe label-switching above?
In unsupervised Gaussian mixture models, label-switching occurs because clusters are all unlabeled. The model estimates cluster parameters but doesn't fix labels to them, so labels like "Class 1" and "Class 2" can switch between iterations.
This doesn't affect the clustering outcome, only the interpretation of which parameters correspond to which label.
In this case, we don't see the sign of label-switching. The EM algorithm has consistently converged, maintaining stable clusters for each set of parameters. And our estimated parameters don't converge into the same values.
The next problem uses the EM algorithm to estimate an unsupervised Gaussian mixture model for the insects data. In particular, the model will be used to classify three insects: observations 6, 244, and 421 in insects.Rdata. The following code plots a (normalised) histogram of the data and highlights the feature values of these three observations.
hist(insects$length, col ="cornflowerblue", main ="Histogram of insects' lengths", prob =TRUE, xlab ="Length", ylim =c(0, 0.4), xlim =c(0, 14))abline(v = insects[6, ], lwd =1.5, col ="lightcoral")abline(v = insects[244, ], lwd =1.5, col ="purple")abline(v = insects[421, ], lwd =1.5, col ="lightpink")legend("topright", legend =c("Obs 6", "Obs 244", "Obs 421"), col =c("lightcoral", "purple", "lightpink"), lwd =c(1.5, 1.5, 1.5))
💪 Problem 4.2
Use the EM_GMM_M2() function to estimate an unsupervised Gaussian mixture model for the insect data in Problem 3. Analyse the convergence. Compare the parameter estimates to those obtained by the function normalmixEM() in the mixtools package.
Tip
The EM algorithm is sensitive to starting values. Try a few different starting values and pick the solution that gives you the highest log-likelihood value. You might have to increase the number of iterations if you have not achieved convergence yet. Finally, recall that the log-likelihood is guaranteed to not decrease at each iteration.
suppressMessages(library(mixtools))mu_start <-c(4, 8) # 2 peaks in insect histogramsigma_start <-c(1, 1) pis_start <-c(0.5, 0.5)n_iter <-25EM_result <-EM_GMM_M2(insects$length, mu_start, sigma_start, pis_start, n_iter = n_iter)plot(EM_result$log_like_all, main ='Log-likelihood in EM_GMM_M2', type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
mixtools_results <-normalmixEM(insects$length, k =2)
number of iterations= 30
plot(mixtools_results$all.loglik, main ="Log-Likelihood in normalmixEM", type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
cat("EM_GMM_M2 parameters (pi, mu, and sigma)\n")
EM_GMM_M2 parameters (pi, mu, and sigma)
cat(EM_result$pi_hat, "\n")
0.7935567 0.2064433
cat(EM_result$mu_hat, "\n")
3.994985 8.103937
cat(EM_result$sigma_hat, "\n\n")
0.9905863 0.6674929
cat("normalmixEM parameters (pi, mu, and sigma)\n")
We can see that the EM_GMM_M2() function converges quicker than the normalmixEM() function.
They converge to almost the same parameter values and the same log-likelihood.
💪 Problem 4.3
Plot the class posterior probabilities for insects 6, 244, and 421 with the model estimated in Problem 4.2. Are the results as expected? Explain.
Tip
The function EM_GMM_M2() returns an \(n\times2\) matrix weights, which contains the class posterior probabilities evaluated at the parameter estimates from the EM algorithm.
As it is an unlabeled dataset, we can directly compare the results to see if it's well estimated.
The results are as expected here. According to the histogram just after the Problem 4.1, Insect 6 and Insect 421 should not belong to the same class which fits well with our estimation. And also, Insect 244 is more close to the class centerlized in length 8 (which includes Insect 421). Our model does estimate it as the same class as Insect 421 with higher probability.
💪 Problem 4.4
Write a function that implements the EM algorithm for the unsupervised Gaussian mixture model with any number of components \(M\), i.e. not limited to \(M=2\) as the EM_GMM_M2() function. You can still assume that \(x\) is univariate. Use your function to estimate two models for the dataset fish.Rdata with, respectively, \(M=3\) and \(M=4\) classes. The dataset can be downloaded from the Canvas page of the course. The dataset contains the lengths of 523 fishes.
Tip
Choose suitable starting values for the means by considering a histogram with 30 breaks of the data. Make sure you run enough iterations by monitoring the convergence via the log-likelihood and the expected log-likelihood.
load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/fish.RData')hist(fish$length, col ="cornflowerblue", main ="Histogram of insects' lengths", breaks =30, prob =TRUE, xlab ="Length", ylim =c(0, 0.08), xlim =c(15, 70))
EM_GMM_M <-function(x, mu_start, sigma_start, pis_start, M, n_iter =100) {# Estimates the parameters in an unsupervised Gaussian mixture model with M classes. # Runs the EM algorithm for n_iter iterations. x is assumed univariate. stopifnot(sum(pis_start) ==1)# Quantities to save pis_all <-matrix(NA, nrow = n_iter, ncol = M) mu_all <-matrix(NA, nrow = n_iter, ncol = M) sigma_all <-matrix(NA, nrow = n_iter, ncol = M) Q_all <-rep(NA, n_iter) log_like_all <-rep(NA, n_iter)# Initialise mu <- mu_start sigma <- sigma_start pis <- pis_start n <-length(x) W <-matrix(0, nrow = n, ncol = M) # Unnormalised weights for each observation log_pdf_class <-matrix(0, nrow = n, ncol = M) # The log-likelihood of the classes for each obs. To compute Q.for(j in1:n_iter){# E-step: Compute the expected log-likelihood Qfor(m in1:M){# The log-density for each class log_pdf_class[, m] <-dnorm(x, mean = mu[m], sd = sigma[m], log =TRUE) +log(pis[m])# Unnormalised weights W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m]) } w <- W/rowSums(W) # Normalise weights n_hat <-colSums(w) # Expected number of obs per class Q <-sum(rowSums(w*log_pdf_class)) # Expected log-likelihood# M-step: Maximise Q. Closed form analytical solution in Gaussian mixture modelsfor(m in1:M){ pis[m] <- n_hat[m]/n mu[m] <-1/n_hat[m]*sum(w[, m]*x) sigma[m] <-sqrt(1/n_hat[m]*sum(w[, m]*(x - mu[m])^2)) }# Save estimates, Q, and log-likelihood pis_all[j, ] <- pis mu_all[j, ] <- mu sigma_all[j, ] <- sigma Q_all[j] <- Q# Compute log-likelihood at current parameter valuesfor(m in1:M){ W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m]) } log_like_all[j] <-sum(log(rowSums(W))) }# Return everything as a listreturn(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma, weights = W/rowSums(W), pis_all = pis_all, mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all, log_like_all = log_like_all))}M <-3mu_start <-c(20, 35, 60)sigma_start <-rep(1, M)pis_start <-rep(1/M, M) EM_result_3 <-EM_GMM_M(fish$length, mu_start, sigma_start, pis_start, M =3, n_iter =100)plot(EM_result_3$log_like_all, main ='Log-likelihood with 3 classes', type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
M <-4mu_start <-c(20, 35, 50, 70)sigma_start <-rep(1, M)pis_start <-rep(1/M, M) EM_result_4 <-EM_GMM_M(fish$length, mu_start, sigma_start, pis_start, M =4, n_iter =100)plot(EM_result_4$log_like_all, main ='Log-likelihood with 4 classes', type ="l", col ="cornflowerblue", xlab ="Iteration", ylab ="Log-likelihood")
💪 Problem 4.5
Plot \(p(x)\) (using the estimates from your EM algorithm) for the two models in Problem 4.4 in the same figure as a (normalised) histogram obtained with the hist() with the argument breaks=30. Which of the two models seem visually better for modelling the fishs’ lengths?
x_grid <-seq(min(fish$length), max(fish$length), length.out =1000)# For M = 3p_x_M3 <- EM_result_3$pi_hat[1] *dnorm(x_grid, mean = EM_result_3$mu_hat[1], sd = EM_result_3$sigma_hat[1]) + EM_result_3$pi_hat[2] *dnorm(x_grid, mean = EM_result_3$mu_hat[2], sd = EM_result_3$sigma_hat[2]) + EM_result_3$pi_hat[3] *dnorm(x_grid, mean = EM_result_3$mu_hat[3], sd = EM_result_3$sigma_hat[3])# For M = 4p_x_M4 <- EM_result_4$pi_hat[1] *dnorm(x_grid, mean = EM_result_4$mu_hat[1], sd = EM_result_4$sigma_hat[1]) + EM_result_4$pi_hat[2] *dnorm(x_grid, mean = EM_result_4$mu_hat[2], sd = EM_result_4$sigma_hat[2]) + EM_result_4$pi_hat[3] *dnorm(x_grid, mean = EM_result_4$mu_hat[3], sd = EM_result_4$sigma_hat[3]) + EM_result_4$pi_hat[4] *dnorm(x_grid, mean = EM_result_4$mu_hat[4], sd = EM_result_4$sigma_hat[4])hist(fish$length, breaks =30, col ="white", prob =TRUE, main ="Histogram of fish lengths with GMM density estimates", xlab ="Length", ylim =c(0, 0.08))# Overlay the density estimate for M = 3lines(x_grid, p_x_M3, col ="red", lwd =2, lty =1)# Overlay the density estimate for M = 4lines(x_grid, p_x_M4, col ="blue", lwd =2, lty =2)legend("topright", legend =c("M = 3", "M = 4"), col =c("red", "blue"), lty =c(1, 2), lwd =2)
Visually, it seems the M=4 model fits better to the fish's lengths. They are quite similar, but the 4 classes model has a more accurate estimation in the range from 40 to 60.
When the features are multivariate, such as in the example with the ASB stock above, the code for the EM algorithm requires some modification. However, the understanding and application of unsupervised Gaussian mixtures is the same as in the univariate case. The next problem uses a package to estimate an unsupervised (multivariate) Gaussian mixture model for the ASB stock application.
💪 Problem 4.6
Use the function mvnormalmixEM() from the mixtools package to estimate an unsupervised (multivariate) Gaussian mixture model with \(M=2\) classes to the ASB stock example with feature vector \(\mathbf{x}=(x_1,x_2)^\top\), with \(x_1=\text{Close}\) and \(x_2=\text{log(Volume)}\). Use all observations as training data. Plot a scatter plot with the predicted classes for the training data (use different colors to represent the different classes).
x1 <-log(asb$Volume)x2 <- asb$Closedata_matrix <-cbind(x1, x2)gmm_model <-mvnormalmixEM(data_matrix, k =2)
number of iterations= 24
predicted_classes <-apply(gmm_model$posterior, 1, which.max)plot(x1, x2, col = predicted_classes, pch =1,xlab ="Close", ylab ="log(Volume)", main ="Gaussian Mixture Model Classification")legend("topright", legend =c("Class 1", "Class 2"), col =1:2, pch =1)
5. Semi supervised learning (3 marks)
In Problems 1 and 2 we assumed that all labels were observed, whereas none of them were observed in Problems 3 and 4. Another scenario is that labels are partially observed, i.e. we know the labels for a set of observations only. Statistical learning and modelling under this assumption is referred to as semi supervised learning.
One of the most common applications of semi supervised learning is in missing data problems. In this problem, we study the capability of semi supervised learning to learn parameters accurately compared to a fully supervised learning algorithm. Moreover, we compare the accuracy of predictions that account for missing labels to those who ignore the observations with missing labels.
We consider the problem of classifying penguins into species. The data are stored in the file palmer_penguins_missing.Rdata, which can be downloaded from the Canvas page of the course3. The dataset contains a set of features for 265 penguins of two different species, Adelie and Gentoo. The species variable is missing for 49 of the penguins. The following code reads in the data and plots the classes and missing labels in a two-dimensional feature space consisting of the penguin’s flipper length (length of the wing) and body mass.
load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/palmer_penguins_missing.RData')# Get indices to the classesind_gentoo <- (palmer_penguins_missing$species =="Gentoo")ind_adelie <- (palmer_penguins_missing$species =="Adelie")ind_missing <-is.na(palmer_penguins_missing$species)x1 <- palmer_penguins_missing$flipper_length_mmx2 <- palmer_penguins_missing$body_mass_gplot(x1[ind_gentoo], x2[ind_gentoo], type ="p", col ="cornflowerblue", xlim =c(170, 250), ylim =c(2500, 7000), xlab ="Flipper length", ylab ="Body mass") lines(x1[ind_adelie], x2[ind_adelie], type ="p", col ="lightcoral") lines(x1[ind_missing], x2[ind_missing], type ="p", pch ="?", cex =0.8, col ="black")legend("topright", legend =c("Gentoo", "Adelie", "Missing"), col =c("cornflowerblue", "lightcoral", "black"), pch =c("o", "o", "?"))
For simplicity we use a single feature, the flipper length, in the problems below.
💪 Problem 5.1
Estimate a (fully) supervised Gaussian mixture model for the penguin data with species as labels and a single feature flipper length. Use separate means and variances for the classes.
Tip
A supervised Gaussian mixture assumes all labels are known. We can remove the observations with missing labels by creating a new data frame with no missing values as df_no_missing<-na.omit(palmer_penguins_missing).
# Plot the data and the Gaussian distributionshist(flipper_adelie, breaks =10, freq =FALSE, col ="lightcoral", xlim =c(170, 250), xlab ="Flipper Length (mm)", main ="Flipper Length Distribution")curve(dnorm(x, mean = mean_adelie, sd =sqrt(var_adelie)), add =TRUE, col ="red", lwd =2)hist(flipper_gentoo, breaks =10, freq =FALSE, col ="cornflowerblue", add =TRUE)curve(dnorm(x, mean = mean_gentoo, sd =sqrt(var_gentoo)), add =TRUE, col ="blue", lwd =2)legend("topright", legend =c("Adelie", "Gentoo"), col =c("red", "blue"), lwd =2)
💪 Problem 5.2
Estimate a semi supervised Gaussian mixture model for the penguin data with species as labels and a single feature flipper length. Compare the parameter estimates to those in Problem 5.1. Comment on the result.
Tip
As discussed in the lecture, semi supervised learning can be cast as a special version of the EM algorithm. Adapt the EM_GMM_M2 function to take an extra argument that contains a vector of labels (with NA for unknown labels) and construct the weights accordingly. Note that the log-likelihood of the non-missing observations is monitored (for convergence) in semi supervised learning. This requires modifying log_like_all.
EM_GMM_M2_semi <-function(x, species, mu_start, sigma_start, pis_start, n_iter =100) {stopifnot(sum(pis_start) ==1) pis_all <-matrix(NA, nrow = n_iter, ncol =2) mu_all <-matrix(NA, nrow = n_iter, ncol =2) sigma_all <-matrix(NA, nrow = n_iter, ncol =2) Q_all <-rep(NA, n_iter) log_like_all <-rep(NA, n_iter)# Initialise mu <- mu_start sigma <- sigma_start pis <- pis_start n <-length(x) W <-matrix(0, nrow = n, ncol =2) # Unnormalized weights for each observation log_pdf_class <-matrix(0, nrow = n, ncol =2) # Log-likelihood of the two classes for each obs.for (j in1:n_iter) {# Start EM steps# E-step: Compute the expected log-likelihood Qfor (m in1:2) {# The log-density for each class log_pdf_class[, m] <-dnorm(x, mean = mu[m], sd = sigma[m], log =TRUE) +log(pis[m])# Unnormalized weights W[, m] <- pis[m] *dnorm(x, mean = mu[m], sd = sigma[m]) }# Handle labeled data: set weights to 1 for the correct class and 0 for the other class w <- W /rowSums(W) # Normalize weightsfor (i in1:n) {if (!is.na(species[i])) {if (species[i] =="Class1") { w[i, ] <-c(1, 0) # Class 1: Weight for Class 1 is 1, for Class 2 is 0 } elseif (species[i] =="Class2") { w[i, ] <-c(0, 1) # Class 2: Weight for Class 2 is 1, for Class 1 is 0 } } } n_hat <-colSums(w) # Expected number of obs per class Q <-sum(rowSums(w * log_pdf_class)) # Expected log-likelihood# M-step: Maximize Q. Closed form analytical solution in Gaussian mixture modelsfor (m in1:2) { pis[m] <- n_hat[m] / n mu[m] <-1/ n_hat[m] *sum(w[, m] * x) sigma[m] <-sqrt(1/ n_hat[m] *sum(w[, m] * (x - mu[m])^2)) }# Save estimates, Q, and log-likelihood pis_all[j, ] <- pis mu_all[j, ] <- mu sigma_all[j, ] <- sigma Q_all[j] <- Q# Compute log-likelihood at current parameter valuesfor (m in1:2) { W[, m] <- pis[m] *dnorm(x, mean = mu[m], sd = sigma[m]) } log_like_all[j] <-sum(log(rowSums(W))) } # End EM iterations# Return everything as a listreturn(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma,weights = W /rowSums(W), pis_all = pis_all,mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all,log_like_all = log_like_all))}mu_start <-c(190, 215) sigma_start <-c(5, 5) pis_start <-c(0.5, 0.5)n_iter <-100EM_result <-EM_GMM_M2_semi(palmer_penguins_missing$flipper_length_mm, palmer_penguins_missing$species, mu_start, sigma_start, pis_start, n_iter =100)print("Estimated parameters (pi, mu, and sigma)")
[1] "Estimated parameters (pi, mu, and sigma)"
print(EM_result$pi_hat)
[1] 0.5321285 0.4678715
print(EM_result$mu_hat)
[1] 189.6257 216.6867
print(EM_result$sigma_hat)
[1] 6.028273 7.020769
This semi-supervised model's means are really close to those of the fully supervised model.
But this model has bigger variances comparing to the supervised one (~= 6^2>22 and 7^2>25).
💪 Problem 5.3
The dataset palmer_penguins.Rdata, which can be downloaded from the Canvas page of the course, contains the true values for the labels that are missing in palmer_penguins_missing.Rdata. Compute the accuracy of the predictions for these observations using your semi supervised Gaussian mixture model in Problem 5.2. Compare that to the classification obtained via your supervised Gaussian mixture model in Problem 5.1. Comment on the result.
Tip
When using the supervised Gaussian mixture model in Problem 5.1 for classification, recall that for a test observation \(x_\star\), the prediction is \[\widehat{y}(x_\star)=\arg\max_m \Pr(y=m|x_\star)=\arg\max_m p(x_\star|y=m)\Pr(y=m).\]
The parameters in the Gaussian class conditionals \(p(x_\star|y=m)\) and the marginal class probabilities \(\Pr(y=m)\) are estimated from the labelled training data (i.e. without missing label observations).
Confusion Matrix and Statistics
Reference
Prediction Adelie Gentoo
Adelie 33 0
Gentoo 1 15
Accuracy : 0.9796
95% CI : (0.8915, 0.9995)
No Information Rate : 0.6939
P-Value [Acc > NIR] : 3.778e-07
Kappa : 0.9528
Mcnemar's Test P-Value : 1
Sensitivity : 0.9706
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9375
Prevalence : 0.6939
Detection Rate : 0.6735
Detection Prevalence : 0.6735
Balanced Accuracy : 0.9853
'Positive' Class : Adelie
cat("\nConfusion Matrix - Semi-Supervised GMM\n")
Confusion Matrix - Semi-Supervised GMM
print(conf_matrix_semi)
Confusion Matrix and Statistics
Reference
Prediction Adelie Gentoo
Adelie 33 0
Gentoo 1 15
Accuracy : 0.9796
95% CI : (0.8915, 0.9995)
No Information Rate : 0.6939
P-Value [Acc > NIR] : 3.778e-07
Kappa : 0.9528
Mcnemar's Test P-Value : 1
Sensitivity : 0.9706
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9375
Prevalence : 0.6939
Detection Rate : 0.6735
Detection Prevalence : 0.6735
Balanced Accuracy : 0.9853
'Positive' Class : Adelie
To conclude, in the single feature flipper length case, the fully supervised model and the semi-supervised model have the same performance because of their identical confusion matrix.
End of the course!
If you made it all the way here without losing too much hair (now you know how I became bald) you should be proud of yourself!
I hope you enjoyed the course and good luck on the final project. The project will be a lighter version of the computer labs, in particular since you may reuse computer lab solutions.